Optimize RequiredIndices column collection#19448
Optimize RequiredIndices column collection#19448feichai0017 wants to merge 11 commits intoapache:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR optimizes the RequiredIndices column collection in the projection optimizer by refactoring from a two-pass approach (using column_refs() + outer_columns()) to a single tree traversal. The goal is to reduce per-query overhead by eliminating temporary HashSet allocations and redundant expression walks.
Key changes:
- Refactored
add_expr()to use a singleTreeNode::apply()traversal instead of callingcolumn_refs()followed byouter_columns() - Added helper functions
collect_outer_ref_exprs()andpush_column_index()to support the new traversal approach - Removed unused
outer_columns()andouter_columns_helper_multi()functions from mod.rs along with the unusedHashSetimport
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| datafusion/optimizer/src/optimize_projections/required_indices.rs | Refactored add_expr() to use single-pass traversal with new helper functions; updated imports to include TreeNode |
| datafusion/optimizer/src/optimize_projections/mod.rs | Removed now-unused outer_columns() and outer_columns_helper_multi() functions; removed unused HashSet import |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@feichai0017 Do you have a micro benchmark for the change? |
|
Not yet, let me run micro benchmark to compare it with old implementation @xudong963 |
|
Ran the new required_indices micro-bench on an M3 Pro MacBook Pro (36 GB RAM). Results from the latest run:
|
|
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
TreeNode::applytraversal, including outer-ref subquery expressions.collect_outer_ref_exprs/push_column_index; remove unused outer_columns and redundant imports.Are these changes tested?
Are there any user-facing changes?